Open In Colab

Problem Statement


Our client is an Insurance company that has provided Health Insurance to its customers now they need your help in building a model to predict whether the policyholders (customers) from past year will also be interested in Vehicle Insurance provided by the company.

An insurance policy is an arrangement by which a company undertakes to provide a guarantee of compensation for specified loss, damage, illness, or death in return for the payment of a specified premium. A premium is a sum of money that the customer needs to pay regularly to an insurance company for this guarantee.

For example, you may pay a premium of Rs. 5000 each year for a health insurance cover of Rs. 200,000/- so that if, God forbid, you fall ill and need to be hospitalised in that year, the insurance provider company will bear the cost of hospitalisation etc. for upto Rs. 200,000. Now if you are wondering how can company bear such high hospitalisation cost when it charges a premium of only Rs. 5000/-, that is where the concept of probabilities comes in picture. For example, like you, there may be 100 customers who would be paying a premium of Rs. 5000 every year, but only a few of them (say 2-3) would get hospitalised that year and not everyone. This way everyone shares the risk of everyone else.

Just like medical insurance, there is vehicle insurance where every year customer needs to pay a premium of certain amount to insurance provider company so that in case of unfortunate accident by the vehicle, the insurance provider company will provide a compensation (called ‘sum assured’) to the customer.

Building a model to predict whether a customer would be interested in Vehicle Insurance is extremely helpful for the company because it can then accordingly plan its communication strategy to reach out to those customers and optimise its business model and revenue.

Now, in order to predict, whether the customer would be interested in Vehicle insurance, you have information about demographics (gender, age, region code type), Vehicles (Vehicle Age, Damage), Policy (Premium, sourcing channel) etc.

Attribute Information


  1. id : Unique ID for the customer

  2. Gender : Gender of the customer

  3. Age : Age of the customer

  4. Driving_License 0 : Customer does not have DL, 1 : Customer already has DL

  5. Region_Code : Unique code for the region of the customer

  6. Previously_Insured : 1 : Customer already has Vehicle Insurance, 0 : Customer doesn't have Vehicle Insurance

  7. Vehicle_Age : Age of the Vehicle

  8. Vehicle_Damage :1 : Customer got his/her vehicle damaged in the past. 0 : Customer didn't get his/her vehicle damaged in the past.

  9. Annual_Premium : The amount customer needs to pay as premium in the year

  10. PolicySalesChannel : Anonymized Code for the channel of outreaching to the customer ie. Different Agents, Over Mail, Over Phone, In Person, etc.

  11. Vintage : Number of Days, Customer has been associated with the company

  12. Response : 1 : Customer is interested, 0 : Customer is not interested

Introduction


Insurance is an agreement by which an individual obtains protection against any losses from an insurance company against the risks of damage, financial losses, damage, illness, or death in return for the payment of a specified premium. In this project, we have an insurance details dataset which contains a total of 381109 rows and 12 features. We have a categorical dependent variable Response which represents whether a customer is interested in vehicle insurance or not. As an initial step, we checked for the null and duplicate values in our dataset. As there were no null and duplicate values present in our dataset, so data cleaning was not required. Further, we normalized the numerical columns for bringing them on the same scale.

In Exploratory Data Analysis, we categorized the Age as YoungAge, MiddleAge, OldAge.Then we categorized Region_Code and Policy_Sales_Channel to extract some valuable information from these features. We explored the independent features using some plots.

For Feature selection, we used Kendall's rank correlation coefficient for numerical features and for categorical features, we applied the Mutual Information technique.

For Model prediction, we used supervised machine learning algorithms like Decision tree Classifier, AdaBoost, LightGBM, BaggingRegressor, NaiveBayes and Logistic regression. Then applied hyperparameter tuning techniques to obtain better accuracy and to avoid overfitting.

So, without any further delay let’s move ahead!

Installing Dependencies


In [ ]:
Found existing installation: scikit-learn 0.22.2.post1
Uninstalling scikit-learn-0.22.2.post1:
  Successfully uninstalled scikit-learn-0.22.2.post1
Collecting scikit-learn
  Downloading scikit_learn-1.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (23.1 MB)
     |████████████████████████████████| 23.1 MB 1.8 MB/s 
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from scikit-learn) (1.0.1)
Requirement already satisfied: numpy>=1.14.6 in /usr/local/lib/python3.7/dist-packages (from scikit-learn) (1.19.5)
Requirement already satisfied: scipy>=1.1.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn) (1.4.1)
Collecting threadpoolctl>=2.0.0
  Downloading threadpoolctl-3.0.0-py3-none-any.whl (14 kB)
Installing collected packages: threadpoolctl, scikit-learn
Successfully installed scikit-learn-1.0 threadpoolctl-3.0.0

Importing Libraries


In [ ]:
In [ ]:
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

Reading Dataset


Let's read the dataset we have to work on! We have a dataset of Health Insurance details.

In [ ]:

Data Wrangling


  • Data wrangling is the process of cleaning and unifying messy and complex data sets for easy access and analysis.

  • This process typically includes manually converting and mapping data from one raw form into another format to allow for more convenient consumption and organization of the data.

Let's dive into the dataset!

Health Insurance Dataset


Columns:

ID: Unique identifier for the Customer.

Age: Age of the Customer.

Gender: Gender of the Customer.

Driving_License: 0 for customer not having DL, 1 for customer having DL.

Region_Code: Unique code for the region of the customer.

Previously_Insured: 0 for customer not having vehicle insurance, 1 for customer having vehicle insurance.

Vehicle_Age: Age of the vehicle.

Vehicle_Damage: Customer got his/her vehicle damaged in the past. 0 : Customer didn't get his/her vehicle damaged in the past.

Annual_Premium: The amount customer needs to pay as premium in the year.

Policy_Sales_Channel: Anonymized Code for the channel of outreaching to the customer ie. Different Agents, Over Mail, Over Phone, In Person, etc.

Vintage: Number of Days, Customer has been associated with the company.

Response (Dependent Feature): 1 for Customer is interested, 0 for Customer is not interested.

Let's deep dive into the dataset,

In [ ]:
Out[5]:
id Gender Age Driving_License Region_Code Previously_Insured Vehicle_Age Vehicle_Damage Annual_Premium Policy_Sales_Channel Vintage Response
0 1 Male 44 1 28.0 0 > 2 Years Yes 40454.0 26.0 217 1
1 2 Male 76 1 3.0 0 1-2 Year No 33536.0 26.0 183 0
2 3 Male 47 1 28.0 0 > 2 Years Yes 38294.0 26.0 27 1
3 4 Male 21 1 11.0 1 < 1 Year No 28619.0 152.0 203 0
4 5 Female 29 1 41.0 1 < 1 Year No 27496.0 152.0 39 0
In [ ]:
Out[6]:
id Age Driving_License Region_Code Previously_Insured Annual_Premium Policy_Sales_Channel Vintage Response
count 381109.000000 381109.000000 381109.000000 381109.000000 381109.000000 381109.000000 381109.000000 381109.000000 381109.000000
mean 190555.000000 38.822584 0.997869 26.388807 0.458210 30564.389581 112.034295 154.347397 0.122563
std 110016.836208 15.511611 0.046110 13.229888 0.498251 17213.155057 54.203995 83.671304 0.327936
min 1.000000 20.000000 0.000000 0.000000 0.000000 2630.000000 1.000000 10.000000 0.000000
25% 95278.000000 25.000000 1.000000 15.000000 0.000000 24405.000000 29.000000 82.000000 0.000000
50% 190555.000000 36.000000 1.000000 28.000000 0.000000 31669.000000 133.000000 154.000000 0.000000
75% 285832.000000 49.000000 1.000000 35.000000 1.000000 39400.000000 152.000000 227.000000 0.000000
max 381109.000000 85.000000 1.000000 52.000000 1.000000 540165.000000 163.000000 299.000000 1.000000
In [ ]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 381109 entries, 0 to 381108
Data columns (total 12 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   id                    381109 non-null  int64  
 1   Gender                381109 non-null  object 
 2   Age                   381109 non-null  int64  
 3   Driving_License       381109 non-null  int64  
 4   Region_Code           381109 non-null  float64
 5   Previously_Insured    381109 non-null  int64  
 6   Vehicle_Age           381109 non-null  object 
 7   Vehicle_Damage        381109 non-null  object 
 8   Annual_Premium        381109 non-null  float64
 9   Policy_Sales_Channel  381109 non-null  float64
 10  Vintage               381109 non-null  int64  
 11  Response              381109 non-null  int64  
dtypes: float64(3), int64(6), object(3)
memory usage: 34.9+ MB

Checking for Duplicate Data:

In [ ]:
Out[8]:
id Gender Age Driving_License Region_Code Previously_Insured Vehicle_Age Vehicle_Damage Annual_Premium Policy_Sales_Channel Vintage Response

Checking for Null Values:

In [ ]:
Out[9]:
id                      0
Gender                  0
Age                     0
Driving_License         0
Region_Code             0
Previously_Insured      0
Vehicle_Age             0
Vehicle_Damage          0
Annual_Premium          0
Policy_Sales_Channel    0
Vintage                 0
Response                0
dtype: int64

Observations:


  • As we can see, our data set contains 381109 rows and 12 columns.
  • We do not have any Null Values in our dataset.
  • We have 4 numeric and 5 categorical independent features.
  • Our dependent feature is a categorical column (Response)

Data Cleaning and Refactoring


Let's reformat and clean the data for smooth processing!

Finding Outliers


Let's take a look at the outliers (if any) in our dataset.

In [ ]:
  • From the above plot it can be implied that Annual Premium has a poitively skewed distribution.
  • From above we can also depict that Vintage has a approximatly uniform distribution.
  • Age columns has some outliers but we are not going to treat them because it won't be affecting our result.

Outlier Treatment and Feature Scaling


  • For Outlier treatment we will be applying quantile method.
  • For feature Scaling we will be using MinMaxScaler technique for Normlization.
In [ ]:
In [ ]:
  • From the above plots we can see that there are no more outliers in Annual Premium.

Exploratory Data Analysis


In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:

Exploring the Numerical Features


We have 4 numerical features: Age, Policy_Sales_Channel, Region_Code, Vintage. Without any further delay, let's explore these features.

In [ ]:
In [ ]:

From the above graphical representation we can conclude on a few points:

  • As we can see, we have a huge dispersion of data in Age feature, so in order to gain better insights on Age feature, we can convert it into categories as YoungAge, MiddleAge and OldAge.
  • Similarly, we can also categorize Region Code and Policy_Sales_Channel.

Converting Numerical Columns to Categorical


In [ ]:

Observations:

  • We can see that Customers belonging to YoungAge group are more likely not interested in taking the vehicle insurance.
  • Similarly, Region_C and Channel_A Customers has the highest chances of not taking the vehicle insurance.
In [ ]:
Out[33]:
id Gender Age Driving_License Region_Code Previously_Insured Vehicle_Age Vehicle_Damage Annual_Premium Policy_Sales_Channel Vintage Response Annual_Premium_Treated Vintage_Treated Age_Group Policy_Sales_Channel_Categorical Region_Code_Categorical
0 1 Male 44 1 28.0 0 > 2 Years Yes 40454.0 26.0 217 1 0.638245 0.716263 YoungAge Channel_B Region_A
1 2 Male 76 1 3.0 0 1-2 Year No 33536.0 26.0 183 0 0.521510 0.598616 OldAge Channel_B Region_C
2 3 Male 47 1 28.0 0 > 2 Years Yes 38294.0 26.0 27 1 0.601797 0.058824 MiddleAge Channel_B Region_A
3 4 Male 21 1 11.0 1 < 1 Year No 28619.0 152.0 203 0 0.438540 0.667820 YoungAge Channel_A Region_C
4 5 Female 29 1 41.0 1 < 1 Year No 27496.0 152.0 39 0 0.419591 0.100346 YoungAge Channel_A Region_B

Gender Distribution


In [ ]:
  • For the above plot, we can say that the no. of male customers in our data set is higher than female customers.

Exploring the Age Feature


In [ ]:

Observation:

  • From the first plot, we can see the Responses received from the different Age_Group.
  • Second plot shows the number of customers of different age group having or not having vehicle insurance.
  • We can say that the customers of YoungAge and OldAge are equally likely to have/not have vehicle insurance whereas customers of MiddleAge has the highest chances of not having a previously insured vehicle insurance.
  • From the third plot, we can see the relation between Age and their Annual_Premium for both Male and Female customers.

Exploring Vehicle Damage


In [ ]:
In [ ]:

Observations:

  • Pie plot shows the number of customers whose vehicle are damaged/not damaged and they took the insurance.
  • From the first point plot, we can say that the chances of taking a vehicle insurance is higher if vehicle is damaged irrespective of VehicleAge group. With the increase in vehicle age, the chances of taking vehicle insurance also increases.
  • The second point plot says that the Annual_Premium is comparetively higher for customers with damaged vehicle.

Exploring Vehicle Age Feature


In [ ]:
In [ ]:

Observations:

  • From the first bar plot, we can see the number of customers of VehicleAge group, took/didn't take the vehicle insurance.
  • The first two plots of the above grid shows the possibility of taking vehicle insurance belonging to a particular VehicleAge group.
  • The third plot of the above grid shows the possibility of taking vehicle insurance belonging to a particular VehicleAge group based on their RegionCode.
  • The fourth plot of the above grid shows the possibility of taking vehicle insurance belonging to a particular VehicleAge group based on their PolicySalesChannel group.
  • From the box plot of the above grid, we can see the relation of Vehicle_Age group and Annual_Premium based on their Vehicle_Damage response.
  • The strip plot shows that the customers having vehicle age >2 Years have the higher chances of taking vehicle insurance.

Exploring Annual Premium


In [ ]:

Observations:

  • From the point plot, we can say that if the Annual_Premium is more then they are more likely to take the vehicle insurance.
  • Second plot also shows the same thing with violin plot.
  • Third plot shows the plattern of responses based on Annual_Premium.
  • Fourth plot is the strip plot for Annual_Premium and Responses.

Annual Premium and Age


In [ ]:
  • The above two plots, bar and violin, shows the distribution of Annual_Premium on the basis of Age_Group.
In [ ]:

Observations:

  • First plot shows the Annual_Premium of people based on their Age.
  • Second plot shows the same but the data points are categorized by Region_Code.

Age Group


In [ ]:
In [ ]:

Observations:

  • The above three pie plots shows the distribution of Age_Group in the Data set based on Response, Annual_Premium and Previously_Insured.
  • The above two pie plots shows the distribution of Region_Code in the Data set based on Vintage and Annual_Premium.

Exploring Policy Sales Channel


In [ ]:

Observations:

  • The two point plots shows the distribution of Policy_Sales_Channel based on Vintage and Annual_Premium_Treated.
  • The next three bar plots shows the number of data points belonging to a particular channel based on Responses.
  • The last bar plot shows the probability of a customer taking vehicle insurance based on Policy_Sales_Channel and Region_Code.

Distribution Plots based on Features


  • The below plots shows the distribution of data points based on different features.
In [ ]:

Dropping Extra Columns


  • As we have already categorized 'Age', 'Region_Code', 'Annual_Premium', 'Policy_Sales_Channel', 'Vintage' features in our data set so we can now drop these features.
  • We can also drop 'ID' and 'Driving_License' as they are not providing any valuable information.
In [ ]:
Out[47]:
Index(['id', 'Gender', 'Age', 'Driving_License', 'Region_Code',
       'Previously_Insured', 'Vehicle_Age', 'Vehicle_Damage', 'Annual_Premium',
       'Policy_Sales_Channel', 'Vintage', 'Response', 'Annual_Premium_Treated',
       'Vintage_Treated', 'Age_Group', 'Policy_Sales_Channel_Categorical',
       'Region_Code_Categorical'],
      dtype='object')
In [ ]:

Feature Selection


Numeric Feature Selection

Let's see the Kendall's correlation between numerical features.

In [ ]:

We have got two numeric features - Annual_Premium_Treated and Vintage_Treated

  • There is no correlation between these two features, as a result we are going to move forward with both of them.

Categorical Features

Let's see the feature importance of categorical features.

In [ ]:
In [ ]:

Mutual Information

Mutual information is one of many quantities that measures how much one random variables tells us about another.

In [ ]:
  • From the above bar plot, we can conclude Previously_Insured is the most important feature and has the highest impact on dependent feature.

One-Hot Encoding


One hot encoding is a process by which categorical variables are converted into a form that could be provided to ML algorithms to do a better job in prediction.

When there is not a ordinal relationship between variables, we use One-Hot Encoding. With One-Hot Encoding the model doesn't assume a natural ordering between categories which may result in poor performance or unexpected results.

In [ ]:
Out[53]:
Index(['Gender', 'Previously_Insured', 'Vehicle_Age', 'Vehicle_Damage',
       'Response', 'Annual_Premium_Treated', 'Vintage_Treated', 'Age_Group',
       'Policy_Sales_Channel_Categorical', 'Region_Code_Categorical'],
      dtype='object')
In [ ]:
Out[54]:
Response Annual_Premium_Treated Vintage_Treated Gender_Female Gender_Male Previously_Insured_0 Previously_Insured_1 Vehicle_Age_1-2 Year Vehicle_Age_< 1 Year Vehicle_Age_> 2 Years Vehicle_Damage_No Vehicle_Damage_Yes Age_Group_MiddleAge Age_Group_OldAge Age_Group_YoungAge Policy_Sales_Channel_Categorical_Channel_A Policy_Sales_Channel_Categorical_Channel_B Policy_Sales_Channel_Categorical_Channel_C Policy_Sales_Channel_Categorical_Channel_D Region_Code_Categorical_Region_A Region_Code_Categorical_Region_B Region_Code_Categorical_Region_C
0 1 0.638245 0.716263 0 1 1 0 0 0 1 0 1 0 0 1 0 1 0 0 1 0 0
1 0 0.521510 0.598616 0 1 1 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 1
2 1 0.601797 0.058824 0 1 1 0 0 0 1 0 1 1 0 0 0 1 0 0 1 0 0
3 0 0.438540 0.667820 0 1 0 1 0 1 0 1 0 0 0 1 1 0 0 0 0 0 1
4 0 0.419591 0.100346 1 0 0 1 0 1 0 1 0 0 0 1 1 0 0 0 0 1 0

So, here we are done with the Feature Selection part of our dataset. Let's train the dataset on different Machine Learning Algorithms.

Machine Learning Algorithms


Let's apply different Machine Learning Models to our data set and see how each of them performs. Firstly, We will tune the hyper-parameters of those models and then we will compare and choose the best model among them, based on Elapsed Time and Evaluation Metrics of the best parameters.

List of Machine Learning Models we are going to train and evaluate our data set on:

  • Decision Tree
  • Gaussian Naive Bayes
  • AdaBoost Classifier
  • Bagging Classifier
  • LightGBM
  • Logistic Regression

###Hyper-Parameter Tuning Methods:

We have tried different hyper-parameter tuning methods. Every method gave the same result but GridSearchCV and RandomizedSearchCV took a huge amount of time to train the models. HalvingRandomizedSearchCV took the least time to train the models and predict the output. That's why we highly recommend you to keep the Tuning_Method as Halving_Randomized_Search_CV from the drop-down menu below.

We have also added some results of the model tuning with GridSearchCV and RandomizedSearchCV, just for performance comparison.

Tuning Methods:

  • HalvingRandomizedSearchCV
  • GridSearchCV
  • RandomizedSearchCV

Evaluation Metrics:

  • Accuracy Score
  • Precision
  • Recall
  • F1 Score
  • ROC AUC Score
  • Log Loss

Plots:

At the end of every model's hyper-parameter tuning, there is one ROC Curve which shows the ROC Scores and Parallel Coordinates Plot which shows all the combinations of hyper-parameters used for tuning the model to get the best parameters.

Let's get started...!

In [ ]:
In [ ]:

Comparison Between Different Tuning Techniques:

GridSearchCV:

image.png

RandomizedSearchCV:

image.png

HalvingSearchCV:

image.png

Decision Tree


Decision tree is the most powerful and popular tool for classification and prediction. A Decision tree is a flowchart like tree structure, where each internal node denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node (terminal node) holds a class label.

Hyper-Parameter Tuning:

splitter: The strategy used to choose the split at each node.

max_depth: The maximum depth of the tree.

min_samples_leaf: The minimum number of samples required to be at a leaf node.

min_weight_fraction_leaf: The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node.

max_features: The number of features to consider when looking for the best split.

max_leaf_nodes: Grow a tree with max_leaf_nodes in best-first fashion.

random_state: Controls the randomness of the estimator.

decisiontree2.png

In [ ]:
################################################################
     <<<< Tuning Model: Halving_Randomized_Search_CV >>>>
****************************************************************
--------------------------------------------------
DecisionTreeClassifier
--------------------------------------------------

Evaluation of DecisionTreeClassifier before tuning:
--------------------------------------------------
   Accuracy_Score  Precision    Recall  F1_Score  ROC_AUC_Score  Log_Loss
0        0.824994    0.27807  0.273458  0.275745       0.587483  6.044574

**************************************************
Best Score for DecisionTreeClassifier : 0.8763718296727145 
---
Best Parameters for DecisionTreeClassifier : {'splitter': 'random', 'random_state': 23, 'min_weight_fraction_leaf': 0.5, 'min_samples_leaf': 5, 'max_leaf_nodes': 40, 'max_features': 'sqrt', 'max_depth': 5}
--------------------------------------------------
Elapsed Time: 00:04:19
==============================

Evaluation of DecisionTreeClassifier after tuning:
--------------------------------------------------
   Accuracy_Score  Precision  Recall  F1_Score  ROC_AUC_Score  Log_Loss
0        0.878172        0.0     0.0       0.0            0.5  4.207802

Gaussian Naive Bayes


Gaussian Naive Bayes is a variant of Naive Bayes that follows Gaussian normal distribution and supports continuous data. Naive Bayes are a group of supervised machine learning classification algorithms based on the Bayes theorem. It is a simple classification technique, but has high functionality.

Hyper-Parameter Tuning:

var_smoothing: Portion of the largest variance of all features that is added to variances for calculation stability.

image.png

In [ ]:
################################################################
     <<<< Tuning Model: Halving_Randomized_Search_CV >>>>
****************************************************************
--------------------------------------------------
GaussianNB
--------------------------------------------------

Evaluation of GaussianNB before tuning:
--------------------------------------------------
   Accuracy_Score  Precision    Recall  F1_Score  ROC_AUC_Score   Log_Loss
0        0.687571   0.268878  0.910044   0.41511       0.783375  10.791173

**************************************************
Best Score for GaussianNB : 0.692433971639338 
---
Best Parameters for GaussianNB : {'var_smoothing': 0.1873817422860384}
--------------------------------------------------
Elapsed Time: 00:00:09
==============================

Evaluation of GaussianNB after tuning:
--------------------------------------------------
   Accuracy_Score  Precision    Recall  F1_Score  ROC_AUC_Score   Log_Loss
0        0.689337   0.269544  0.906454  0.415527       0.782835  10.730149

AdaBoost Classifier


AdaBoost algorithm, short for Adaptive Boosting, is a Boosting technique used as an Ensemble Method in Machine Learning. It is called Adaptive Boosting as the weights are re-assigned to each instance, with higher weights assigned to incorrectly classified instances.

Hyper-Parameter Tuning:

n_estimators: The maximum number of estimators at which boosting is terminated. In case of perfect fit, the learning procedure is stopped early.

learning_rate: Weight applied to each classifier at each boosting iteration.

random_state: Controls the randomness of the estimator.

adaboost_classifier.png

In [ ]:
################################################################
     <<<< Tuning Model: Halving_Randomized_Search_CV >>>>
****************************************************************
--------------------------------------------------
AdaBoostClassifier
--------------------------------------------------

Evaluation of AdaBoostClassifier before tuning:
--------------------------------------------------
   Accuracy_Score  Precision  Recall  F1_Score  ROC_AUC_Score  Log_Loss
0        0.878172        0.0     0.0       0.0            0.5  4.207802

**************************************************
Best Score for AdaBoostClassifier : 0.8633333333333333 
---
Best Parameters for AdaBoostClassifier : {'random_state': 2, 'n_estimators': 10, 'learning_rate': 0.01}
--------------------------------------------------
Elapsed Time: 00:00:29
==============================

Evaluation of AdaBoostClassifier after tuning:
--------------------------------------------------
   Accuracy_Score  Precision  Recall  F1_Score  ROC_AUC_Score  Log_Loss
0        0.878172        0.0     0.0       0.0            0.5  4.207802

Bagging Classifier


A Bagging classifier is an ensemble meta-estimator that fits base classifiers each on random subsets of the original dataset and then aggregate their individual predictions (either by voting or by averaging) to form a final prediction.

Hyper-Parameter Tuning:

n_estimators: The maximum number of estimators at which boosting is terminated.

random_state: Controls the randomness of the estimator.

bagging_classifier.png

In [ ]:
################################################################
     <<<< Tuning Model: Halving_Randomized_Search_CV >>>>
****************************************************************
--------------------------------------------------
BaggingClassifier
--------------------------------------------------

Evaluation of BaggingClassifier before tuning:
--------------------------------------------------
   Accuracy_Score  Precision    Recall  F1_Score  ROC_AUC_Score  Log_Loss
0        0.853647    0.30408  0.156221  0.206403       0.553311  5.054895

**************************************************
Best Score for BaggingClassifier : 0.8727272727272728 
---
Best Parameters for BaggingClassifier : {'random_state': 26, 'n_estimators': 10}
--------------------------------------------------
Elapsed Time: 00:00:20
==============================

Evaluation of BaggingClassifier after tuning:
--------------------------------------------------
   Accuracy_Score  Precision   Recall  F1_Score  ROC_AUC_Score  Log_Loss
0        0.853349   0.303517  0.15737  0.207272       0.553636  5.065167

LightGBM Classifier


LightGBM, short for Light Gradient Boosting Machine, is a distributed gradient boosting framework.It uses Histogram based splitting, Gradient-based One-Side Sampling (GOSS) ans Exclusive Feature Bundling (EFB) making it a fast algorithm.

Hyper-Parameter Tuning:

n_estimators: Number of Boosting iterations.

learning_rate: This setting is used for reducing the gradient step. It affects the overall time of training: the smaller the value, the more iterations are required for training.

min_data_in_leaf: Minimal number of data in one leaf. Can be used to deal with over-fitting

random_state: Controls the randomness of the estimator.

lightgbm.jpeg

In [ ]:
################################################################
     <<<< Tuning Model: Halving_Randomized_Search_CV >>>>
****************************************************************
--------------------------------------------------
LGBMClassifier
--------------------------------------------------

Evaluation of LGBMClassifier before tuning:
--------------------------------------------------
   Accuracy_Score  Precision    Recall  F1_Score  ROC_AUC_Score  Log_Loss
0        0.878215   0.545455  0.002154  0.004291       0.500952  4.206292

**************************************************
Best Score for LGBMClassifier : 0.8749999411774395 
---
Best Parameters for LGBMClassifier : {'n_estimators': 100, 'min_data_in_leaf': 250, 'max_depths': 3.0, 'learning_rate': 0.001}
--------------------------------------------------
Elapsed Time: 00:03:35
==============================

Evaluation of LGBMClassifier after tuning:
--------------------------------------------------
   Accuracy_Score  Precision  Recall  F1_Score  ROC_AUC_Score  Log_Loss
0        0.878172        0.0     0.0       0.0            0.5  4.207802

Logistic Regression


The logistic classification model is a binary classification model in which the conditional probability of one of the two possible realizations of the output variable is assumed to be equal to a linear combination of the input variables, transformed by the logistic function.

Hyper-Parameter Tuning:

solver: Algorithm to use in the optimization problem.

penalty: Specify the norm of the penalty.

C: Inverse of regularization strength

random_state: Controls the randomness of the estimator.

logistic_regression.jpeg

In [ ]:
################################################################
     <<<< Tuning Model: Halving_Randomized_Search_CV >>>>
****************************************************************
--------------------------------------------------
LogisticRegression
--------------------------------------------------

Evaluation of LogisticRegression before tuning:
--------------------------------------------------
   Accuracy_Score  Precision  Recall  F1_Score  ROC_AUC_Score  Log_Loss
0        0.878172        0.0     0.0       0.0            0.5  4.207802

**************************************************
Best Score for LogisticRegression : 0.8750259605399793 
---
Best Parameters for LogisticRegression : {'solver': 'sag', 'random_state': 2, 'penalty': 'l2', 'C': 0.001}
--------------------------------------------------
Elapsed Time: 00:00:06
==============================

Evaluation of LogisticRegression after tuning:
--------------------------------------------------
   Accuracy_Score  Precision  Recall  F1_Score  ROC_AUC_Score  Log_Loss
0        0.878172        0.0     0.0       0.0            0.5  4.207802

Best Model


From all the above models that we tried to train and predict the output, we can conclude that Bagging Classifier is the best model for our data set. The best parameter of this model is {'n_estimators': 200}. Its Accuracy Score is 0.85, Precision is 0.31, Recall is 0.15, F1_Score is 0.20, ROC_AUC_Score is 0.55 and Log_Loss is 4.98. Its Elapsed time is 03 minutes and 21 seconds.

We can see that we have other models with higher Accuracy Score than Bagging Classifier. But the problem with those models is, their Precision and Recall values are zero which means True Positives are zero. That means those models are unable to predict correct output if any customer is ready to take vehicle insurance. And as we all know, classification accuracy alone can be misleading if you have an unequal number of observations in each class. This is exactly the case with our data set.

Hence, Bagging Classifier is the best model for our data set.

NOTE: You might get a slight difference in result every time you run because we are using Halving_Randomized_Search_CV to perform hyperparameter tunning which randomly selects the combination of parameters to tune the model.

Extracting Feature Importance


We got our best model with its hyper-parameter values. Let's have a look at the feature importance of our data set.

In [ ]:
In [ ]:

Observations:

  • Annual_Premium_Treated has impacted the most in the prediction.
  • Gender_Male has highest feature weight but less cumulative weight.

Conclusion


Starting from loading our dataset, we initially checked for null values and duplicates. There were no null values and duplicates so treatment of such was not required. Before data processing, we applied feature scaling techniques to normalize our data to bring all features on the same scale and make it easier to process by ML algorithms.

Through Exploratory Data Analysis, we categorized Age as YoungAge, MiddleAge, and OldAge, then we categorized the Region_Code as Region_A, Region_B, Region_C. We categorized the Policy_Sales_Channel into channel_A, channel_B, channel_C. Further, we observed that customers belonging to youngAge are more interested in vehicle response. We observed that customers having vehicles older than 2 years are more likely to be interested in vehicle insurance. Similarly, customers having damaged vehicles are more likely to be interested in vehicle insurance.

For Feature Selection, we used Kendall's rank correlation coefficient for numerical features and for categorical features, we applied the Mutual Information technique. Here we observed that Previously_Insured is the most important feature and has the highest impact on the dependent feature and there is no correlation between the two numeric features

Further, we applied Machine Learning Algorithms to determine whether a customer would be interested in Vehicle Insurance. For the Naive Bayes algorithm, we got an accuracy score of 68% and after hyperparameter tuning, the accuracy score increased to 72%. Similarly, for Decision Tree Classifier, AdaBoost, BaggingClassifier, LightGBM accuracy score was obtained around 82%-87%. So, we selected our best model as the model with an accuracy score of 85% considering precision and recall as we have an unequal number of observations in each class in our dataset, so accuracy alone can be misleading.

That’s it! We reached the end.